knitr::opts_chunk$set(fig.width=10, fig.height=8, fig.path='Figs/',
                      echo=FALSE, warning=FALSE, message=FALSE,cache = TRUE)

About the dataset & Project To start with, I am an ardent football fan with most of my free time spent either playing football or watching football. If not this, then I play PC football game - FIFA.

Therefore, I have decided to use FIFA15 data as my working data.

This dataset consists Fifa15 (a PC game) data, which has 12,236 observations across 60 variables. Most of these variables are around players’ skills. These skills are distributed across 5 major attributes - “Speed”, “Shooting”, “Passing”, “Technical”, Defending" and “Others”

Besides skills, the data also has players names, their nationality, preferred foot and the leagues in which they play

## 'data.frame':    12236 obs. of  60 variables:
##  $ Club.Name           : Factor w/ 579 levels "_aykur Rizespor",..: 513 364 49 166 567 568 287 522 496 89 ...
##  $ Nation              : Factor w/ 140 levels "Albania","Algeria",..: 33 135 48 133 133 133 133 91 133 133 ...
##  $ League              : Factor w/ 38 levels "Abdul Latif Jameel League",..: 6 23 21 31 4 4 27 18 31 12 ...
##  $ SearchName          : Factor w/ 11999 levels "_horarinn Ingi Valdimarsson~_h\xdcrarinn Ingi Valdimarsson",..: 1143 6 7 8 9 10 11 12 13 14 ...
##  $ position            : Factor w/ 17 levels "CAM","CB","CDM",..: 2 2 3 7 1 7 14 2 17 2 ...
##  $ Rating              : int  72 63 60 55 56 73 68 67 57 72 ...
##  $ Category            : Factor w/ 3 levels "MOTM","Regular",..: 2 2 2 2 2 3 2 2 2 2 ...
##  $ Skill               : int  2 2 2 2 2 2 3 2 3 2 ...
##  $ awr                 : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ dwr                 : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ PreferredFoot       : Factor w/ 2 levels "Left","Right": 2 2 2 2 2 1 2 2 1 2 ...
##  $ weakfoot            : int  3 3 3 4 3 3 4 3 2 3 ...
##  $ Height              : int  190 183 176 181 173 170 170 184 180 183 ...
##  $ CommonName          : Factor w/ 1526 levels "","\x8dlex Barrera",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ statsid             : int  184852 202210 206509 209024 223914 50521263 186170 140161 198543 17725 ...
##  $ Position            : Factor w/ 17 levels "CAM","CB","CDM",..: 2 2 3 7 1 7 14 2 17 2 ...
##  $ Rare                : int  1 0 0 0 0 3 0 0 1 1 ...
##  $ Speed.attributes    : int  72 61 64 54 59 82 81 39 81 32 ...
##  $ Acceleration        : int  65 63 66 54 57 80 82 37 80 33 ...
##  $ SprintSpeed         : int  77 59 63 54 61 83 80 40 81 31 ...
##  $ Shooting.attributes : int  45 34 38 32 45 59 56 33 56 34 ...
##  $ Positioning         : int  27 48 49 28 47 61 67 25 55 28 ...
##  $ Finishing           : int  30 32 31 29 45 54 52 20 57 33 ...
##  $ ShotPower           : int  78 30 51 33 52 75 60 64 55 47 ...
##  $ LongShot            : int  57 37 37 37 35 60 56 28 54 25 ...
##  $ Volleys             : int  29 33 32 29 49 30 51 47 53 32 ...
##  $ Penalties           : int  37 51 51 37 59 62 60 36 57 44 ...
##  $ Passing.attributes  : int  39 53 59 46 62 71 67 57 50 51 ...
##  $ TactAware           : int  28 54 60 36 61 67 71 53 55 40 ...
##  $ Vision              : int  28 54 60 36 61 67 71 53 55 40 ...
##  $ ShortPass           : int  48 62 64 54 67 70 66 64 50 64 ...
##  $ LongPass            : int  39 55 59 39 68 70 64 60 49 57 ...
##  $ Curve               : int  36 51 40 25 57 70 68 37 45 41 ...
##  $ Crossing            : int  36 38 56 58 56 80 67 56 49 45 ...
##  $ FKAcc               : int  26 47 42 28 49 71 62 39 49 24 ...
##  $ Technical.attributes: int  44 49 55 46 58 71 70 57 58 51 ...
##  $ BallControl         : int  48 50 62 53 61 68 68 61 56 57 ...
##  $ Dribbling           : int  36 42 45 38 56 68 69 55 56 46 ...
##  $ Balance             : int  58 67 77 55 64 93 92 44 61 59 ...
##  $ Agility             : int  52 60 72 59 56 81 81 63 71 49 ...
##  $ Reactions           : int  70 68 66 56 57 71 50 56 54 70 ...
##  $ Defending.attributes: int  72 64 52 52 44 73 32 66 36 75 ...
##  $ Marking             : int  70 66 46 55 43 78 22 66 35 78 ...
##  $ Interceptions       : int  66 65 53 46 30 67 54 61 31 77 ...
##  $ StandingTackle      : int  76 63 61 54 47 77 21 68 35 72 ...
##  $ SlideTackle         : int  70 60 56 54 59 77 21 65 34 74 ...
##  $ HeadingAcc          : int  77 61 40 50 48 55 59 67 53 72 ...
##  $ Other.attribute     : int  83 64 66 64 44 66 69 72 55 70 ...
##  $ Strength            : int  94 62 62 60 40 53 71 81 47 74 ...
##  $ Jumping             : int  47 68 52 60 56 89 71 73 57 83 ...
##  $ Aggression          : int  88 67 68 71 44 70 59 68 54 65 ...
##  $ Stamina             : int  64 65 75 66 49 84 72 59 70 62 ...
##  $ Potential           : int  72 68 68 60 64 77 72 67 64 72 ...
##  $ GKDiving            : int  12 5 10 6 13 13 15 14 6 7 ...
##  $ GKHandling          : int  7 11 13 14 10 6 10 11 10 5 ...
##  $ GKKicking           : int  14 13 15 15 6 8 11 12 6 15 ...
##  $ GKPositioning       : int  15 13 14 9 6 8 8 11 7 11 ...
##  $ GKReflexes          : int  8 14 6 12 12 11 12 10 15 10 ...
##  $ playerID            : int  184852 202210 206509 209024 223914 189615 186170 140161 198543 17725 ...
##  $ statsid.1           : int  184852 202210 206509 209024 223914 50521263 186170 140161 198543 17725 ...

UNI-VARIATE GRAPHS

It would be interesting to see the distribution of players across various leagues

lab <- as.vector(c(fifadata$League)) text(seq(1, 37, by=1),par(“usr”)[3] - 0.2, labels = lab, srt = 45, pos = 1, xpd = TRUE)

Sticking to Leagues only, let’s see mean of the ratings across various leagues

x <- barplot(table(mtcars\(cyl), xaxt="n") labs <- paste(names(table(fifadata\)League)), “”) text(cex=1, x=x-.25, y=-1.25, labs, xpd=TRUE, srt=45, pos=2)

p2 + theme(axis.title.x = element_text(face=“bold”, colour=“#990000”, size=20), axis.text.x = element_text(angle=90, vjust=0.5, size=12))

Another interesting observation would be to measure mean rating across multiple playing positions

We also have preferred foot - ‘Left’ or ‘Right’. Let’s distribute the players accordingly

Interestingly Right footed players are thrice of left footed

Now distributing players according to position

This plot showcased clear picture across various postions -

In this data, we have current ratings of the players, as well as the potential future ratings. It will be interesting to oberve, how ratings will shift in the future

Current ratings

Potential ratings

Overlapping Current & future ratings

The future ratings are slightly on the higher side, reflecting players will grow eventually

summary(fifadata)

Univariate Analysis

What is the structure of your dataset? There are 12,236 players in the dataset with 60 features. Most of these variables are around players’ skills. These skills are distributed across 5 major attributes - “Speed”, “Shooting”, “Passing”, “Technical”, Defending" and “Others”

Besides skills, the data also has players names, their nationality, preferred foot and the leagues in which they play

Other observations:

What is/are the main feature(s) of interest in your dataset? At present, skills and positions seems to be interesting feature. Just like in reality, various positions require different skills. Similarly there can be multiple factors, determining the overall rating of the player. I will explore the relation between Position and skills, as well as Rating and multiple skills in next section

What other features in the dataset do you think will help support your investigation into your feature(s) of interest? Nationality can impact rating. For example, Brazilian players are much more talented than their Asian counterparts. Similarly Preferred Foot (Right or Left) can affect playing position, rating etc.

Did you create any new variables from existing variables in the dataset? I’ve created a variable - ‘Mindset’ depending on the playing positions of the players. This variable will determine whether player prefers to attack or defend. More on it in Bi-variateanalysis

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this? None so far. The distribution seems on expected lines

BI-VARIATE PLOTS

This data has Nationality as another feature. Let’s explore the same. Plotting players distribution across various nations

## 133 codes from your data successfully matched countries in the map
## 7 codes from your data failed to match with a country code in the map
##      failedCodes failedCountries     
## [1,] NA          "C\x8ete d'Ivoire"  
## [2,] NA          "Cape Verde Islands"
## [3,] NA          "China PR"          
## [4,] NA          "Congo DR"          
## [5,] NA          "Cura\x8dao"        
## [6,] NA          "Holland"           
## [7,] NA          "Korea DPR"         
## 110 codes from the map weren't represented in your data

The world map show large amount of players in this dataset belongs to Europe and South America

Exploring mean ratings across various nations

## 133 codes from your data successfully matched countries in the map
## 7 codes from your data failed to match with a country code in the map
##      failedCodes failedCountries     
## [1,] NA          "C\x8ete d'Ivoire"  
## [2,] NA          "Cape Verde Islands"
## [3,] NA          "China PR"          
## [4,] NA          "Congo DR"          
## [5,] NA          "Cura\x8dao"        
## [6,] NA          "Holland"           
## [7,] NA          "Korea DPR"         
## 110 codes from the map weren't represented in your data

Usually we relate particular playing position wth skills like -

Let’s plot graph around these common assumptions

As reflexes increases, ratings increased

Height, unexpectedly, is not a major factor in defenders’ ratings

The above 2 graphs proves our hypothesis Vision and Shooting power corelates to midfielder and striker respectively

Earlier we saw that height does not corelate to Defenders’ ratings Another important factor while defending can be the strenth of the player. The above graph showcased this relation. Barring few anamolies, there is a positive corelation between strength and rating

Next, take a look around players from different positions and their preferred foot

The mosaic plot, though clutterred, gave few ideas into positions like -

The above results seem on regular lines and thus wont be investigating them further

Now coming back to skills of the players. Let’s explore some skills and their relation with the players ratings

Speed does not seems to be much corelated with the ratings

Similarly Height and ratings have no corelation

The above two graphs displayed the importance of other skills like shooting, passing, heading etc. in developing a player

In this dataset we have a feature called - ‘Skill’ This feature measures ability of the players,on the scale of (1-5), to perform acrobatic skills. 1 being very basic skills and 5 being very acrobatic like bicycle kick

In the following plot, we will see impact of skill on ratings

Sticking to ‘skill’ feature, we will try to see the skill distribution across various positions

* As expected, attacking positions like strikers, wingers, attacking midfielders have 5 star skill players * Goalkeepers, expectdly, are of 1 star skills

It would be interesting to observe rating distribution among various positions

We have seen that height does not have much impact on overall rating of the players. However there is one area where height should be useful - ‘Heading’ The same is been depicted in above plot where height and heading ability are strongly positive corelated

Before ending this Bi-variate analysis, we will quickly observe certain skills specific to positional attributes

Sprint speed is highly corelated with speed ability

Finishing seems to be an important aspect for shooting

Positioning appears to be another factor for strikers when they are shooting

Penalties, expectedly, loosely corelated with shooting abilities

ShortPass are strangely not that important in passing

LongPasses are an important feature of passing

Crossing is another important aspect of apssing

Dribbling seems to be forming heart of the technical ability of the player

Reactions are almost unrelated to technical ability

Ball control is another important aspect

Marking as depicted above, just like in real life, an important factor in defending

Standing tachles also appear to be indispensible factor

Slide tackles are important factors, but strangely less corelated than standing tackles

Reflexes seems to be important for Goalkeepers while diving

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset? Various positions demand different skill sets

Positions wise analysis -

Shooting -

Passing -

Technical abilities -

Defending -

Overall Rating -

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

What was the strongest relationship you found?

MULTI-VARIATE PLOTS

As we have seen that many skills can attribute to overall ratings of the players. Lets establish major attributes of the striker Finishing, Positioning and shooting appears to be corelated with the striker’s ratings but league has little impact on it. This suggest we can have good strikers playing in the poor leagues as well - called as hidden gems

Trying something similar for Passing ability of a player The above corrgram plot has not showcase strong corelation in terms of passing ability, but individual skills are strongly corelated - TactAware and Vision, ShortPass and LongPass This demonstrates skills can be interdependent on each other

Also we have seen major playing attributes - Speed, Shooting, Passing, Technical, Defending Following plot will observe relation between these attributes and overall ratings of the players While Shooting, Passing and Technical attributes seem to be major contributers. Defending, unfortunately, is loosely corelated with overall rating

Basis on the playing position, player can contribute to the attack or defense of the team Following code will create new variable - ‘mindset’, to segregate players into attack or ‘defense’, basis their playing positions

In the same code atop, we have created top 5 leagues, basis mean ratings and players playing in those leagues (1 each from England, Spain, Germany, France and Italy) - ‘Topleagues’ This will help us in focussed visualizion

The above dot plot shows various teams of 5 top leagues and their average mean rating across ‘Attack’ and ‘Defence’

Following graph will explore major playing attributes across top 5 leagues Above star plot shows, superiority of Liga BBVA (Spanish League) over other 4 leagues in almost all departments

‘Aggression’ and ‘Skill’, both of these players attributes are tough to measure. Let’s see their impact on Rating

5 star skill players (yellow colored) are on higher end of rating while nothing definite can be said about aggression’s impact on ratings

The above graph measured role of vision and standing tackle in forming Defending attributes of the players. As depicted in graph, vision does not have much say in forming defending attribute while standing tackle is an important aspect for the same

Aggression does not seem to be high impacting feature in passing attribute of the player but plays important role in forming defending attribute of the player

GoalKeeper diving, reflexes and positioning are corelated, understandably

There’s another variable - ‘Work rate’ to measure hardworking aspect of the player (on the scale of 1 to 3) It is further divided into Defensive work Rate - DWR and Attacking Work rate - AWR

The above graph corelates DWR with defensive attribute wand ratings. Yellow dots (high DWR) are mostly on the higher end of defending attributes but scatterred among Ratings end. This depicts DWR has some impact on defending attribute, but not on rating

Just like previous graph, this one compared AWR with Shooting attribute and Rating Unexpectedly Yellow dots (high AWR) are scatterred over here, showcasing nil impaact of AWR on ratings or shooting abilities

Passing ability of a player is determined by number of factors. Pass type - Short or Long is one of that criteria Though this helix graph is tough to comprehend but major trend which can be seen is that long passes have much more impact than short passing value on Passing ability of the player

We have not explored one variable so far - ‘Rare’ - This variable depicts how rare a player is (on scale of 1-9). The rarity is defined by the player’s combination of existing skills

The above plot is an extension of bi-variate graph where we measured skill across positions. In this graph we are adding third variable- Rare (highlighted by size) to that. Interestingly 5 star skill players are not that rare, highlighting that acrobatic skills are easier to possess. In terms of position, rare players are seen mostly on attacking front - Strikers, wingers, attacking midfielders

Modeling

Before ending my analysis, I would like to create two models - GLM to predict any new player ratings basis his skills value and clustering to form cluster of similiar type players

## 
## Call:
## lm(formula = Rating ~ GKHandling, data = gk)
## 
## Coefficients:
## (Intercept)   GKHandling  
##     11.6304       0.8432
## 
## Call:
## lm(formula = Rating ~ GKHandling + GKDiving + GKReflexes, data = gk)
## 
## Coefficients:
## (Intercept)   GKHandling     GKDiving   GKReflexes  
##     -0.4891       0.3925       0.2912       0.3118
## 
## Call:
## lm(formula = Rating ~ GKHandling + GKDiving + I(GKReflexes^2), 
##     data = gk)
## 
## Coefficients:
##     (Intercept)       GKHandling         GKDiving  I(GKReflexes^2)  
##       10.079227         0.390715         0.287094         0.002347
##   GKHandling GKDiving GKReflexes GKPositioning predictions.fit
## 1         88       75         82            84        81.71997
## 2         76       66         75            77        72.79115
## 3         68       58         71            70        66.01376
##   predictions.lwr predictions.upr
## 1        80.51953        82.92041
## 2        71.59311        73.98918
## 3        64.81453        67.21300

In the above example, I created a model to predict Goalkeeper rating basis on value of some of skills specific to Goalkeeper, that are - handling, positioning, reflexes and Diving

It’s always an interesting task to unearth players which are similar to other top players. So following is the clustering model to cluster top strikers, midfielders, defenders and goalkeepers. By top, I mean the ones having overall rating of 80 and above

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest? Finishing, Positioning and shooting appears to be corelated with the striker’s ratings but league has little impact on it

While Shooting, Passing and Technical attributes seem to be major contributers. Defending, unfortunately, is loosely corelated with overall rating

Vision does not have much say in forming defending attribute while standing tackle is an important aspect for the same

Aggression does not seem to be high impacting feature in passing attribute of the player but plays important role in forming defending attribute of the player

Were there any interesting or surprising interactions between features? Aggression does not seem to be high impacting feature in passing attribute of the player but plays important role in forming defending attribute of the player. This is on parallel lines with real life football game as defenders are pretty aggressive while tackling and winning the balls, whearass midfielders are more of a technical skills

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model. Yes, I created a general linear model to predict Goalkeeper rating basis on value of some of skills specific to Goalkeeper, that are - handling, positioning, reflexes and Diving.

Also position specific clusters are created to pair similiar styled players

Final Plots and Summary

Plot One

Why this plot - To begin any analysis, one should be aware of important variables in the dataset. In this dataset, most of the variables are around the skills, which can be explored as we go. Another important aspect is the playing position of the player. In the game, as well as in real life, it is important to have players playing on their preferred positions. So knowledge about positions is indispensible for any manager. Thus, I choose this graph to measure players count across various positions

Key takeaways - This plot showcased clear picture across various postions - * Centre Backs(CB) and Strikers(ST) are major playing positions * Centre Midfield(CM) and Goalkeepers(GK) are on 3rd and 4th position * We have very less players on RF and LWB

Personal Opinion - This dataset matches real world database of football players where 30% of the players are either strikers or centre backs. The third position of centre midfield is again reflection of real world position preferences. As we streamlined playing positions - LWB, RWB etc. the count decreased drastically, However, unexpectedly, we have jus 1 Right forward and 3 left forward

Plot Two

Why this plot - Most of the features (playing skills) of a player can be reflected in his/her game. Also the same can be proved by the statistics / ratings. For e.g. ‘Finishing’ is an important aspect for the striker as well overall rating of the player. However there are some aspects which wont reflect in the stats but in the playing style of the player and very well decide the fate of the match. ‘Aggression’ is one such factor as this can come handy at the time of snatching the ball from the opponent. However over aggressive players can commit fouls, resulting in sending off from the match. Thus selected this plot to measure how aggressiveness is related to the skills

Key takeaways - Aggression does not seem to be high impacting feature in passing attribute of the player but plays important role in forming defending attribute of the player

Personal Opinion - Defending attribute is composed of tackling, marking, interceptions etc. and these things requires little bit of aggression as player has to snatch the ball from the opponnent. Wheras passing aspect is more about varied range of passes, crosses, vision etc. and thus requires more about the ball control than the aggression. Thus graph aptly depicting aggression within defending attribute region

Plot Three

Why this plot - Time over time, we compare leagues - sometimes to show financial disparity between them and sometimes to show difference in playing attributes. Mostly the comparison is based on top stars in various leagues. However, in this graph I wanted to compare mean score of top 5 leagues to truly measure their playing abilities. To make things fairer, this graph is measure across 5 major playing paramateres - Speed, Shooting, Passing, Technical, Defending

Key takeaways - Above star plot shows, superiority of Liga BBVA (Spanish League) over other 4 leagues in almost all departments. In defense, Bundesliga (German League) is better than rest of the 4 leagues. Ligue 1 (French League) is the poor cousin of all and is below par than other leagues

Personal Opinion - Spanish league with some of the great skillsets players like Messi, Ronaldo, Neymar etc. definitely looks better than other leagues and is aptly shown in this graph. German league, despite having low star power and viewership performed well as per this plot. The fact, depicted by constant entries of German teams in European competitions. Also Germany’s World cup victory further strengthen the fact of superior German league. French league, due to financial crisis, is rightly shown below par than other

Miscellanious Plots

Plot4

These plots are informative in understanding the data and polished. So thought to include them besides 3 top plots

## 133 codes from your data successfully matched countries in the map
## 7 codes from your data failed to match with a country code in the map
##      failedCodes failedCountries     
## [1,] NA          "C\x8ete d'Ivoire"  
## [2,] NA          "Cape Verde Islands"
## [3,] NA          "China PR"          
## [4,] NA          "Congo DR"          
## [5,] NA          "Cura\x8dao"        
## [6,] NA          "Holland"           
## [7,] NA          "Korea DPR"         
## 110 codes from the map weren't represented in your data

Plot5 we have created top 5 leagues, basis mean ratings and players playing in those leagues (1 each from England, Spain, Germany, France and Italy) - ‘Topleagues’ This will help us in focussed visualizion

The above dot plot shows various teams of 5 top leagues and their average mean rating across ‘Attack’ and ‘Defence’

Reflection

General The Fifa15 data set contains information on about 12,000 players across the world which are part of EA Sports’ FIFA15 game. I started this project by understanding the individual variables in the data set, and then I explored interesting questions and leads as I continued to make observations on plots. Existing knowledge of football also helped me in grasping the skills. Some of the results are on expected lines while others are surprising

Interesting pointers Nationality appears to be interesting factor in determing players count as well as mean ratings. South american nations - Brazil, Argentina dominates rating. Interestingly, African nations - Ghana, Ivory Coast are also on high end of rating spectrum. Lowest rating of nation = 55.3 while highest rating = 76, so still quite a disparity

Finishing seems to be strong variable for shooting. Similarly, crossing skill in passing ability and marking in defending ability appears to be positively corelated. Reflexes is an important skill for Goalkeepers

Surprises Defending ability of a player is not strongly corelated with ratings

Sprint speed does not have any impact on ratings

Challenges Some of the players are playing on multiple positions. It was hard to drop them (or any particular position of theirs) While plotting a map, nations names are not matching with worldmap database so altered the names

Modelling Clustering of the players can be very useful as in real or virtual game we always want a cheaper player with huge potential. So similar playing skills can at times unearth future stars

Next steps I would like to build a model to predict the chemistry (understanding) between the players. Chemistry can be dependent on players belonging to same nation, same style, same league etc.